Python 爬虫

#Python 爬虫

在本章中，你将学习几个模块，让Python中抓取网页变得很容易。
webbrowser：是Python 自带的，打开浏览器获取指定页面。
requests：从因特网上下载文件和网页。
Beautiful Soup：解析HTML ，即网页编写的格式。
selenium：启动并控制一个Web 浏览器。能够填写表单，并模拟鼠标在浏览器中的点击。

01-使用webbrowser模块的maplit.py

webbrowser 模块open()函数可以启动一个新的浏览器，打开指定的URL。

import webbrowser
webbrowser.open('https://ditu.amap.com/')

一、弄清楚URL

高德地图URL 的一个例子

对于地点搜索，你需要知道这些地图的URL 格式，如何使用这些URL 格式快速搜索。

https://ditu.amap.com/search?query=千山半岛

URL 说明

首先需要弄清楚，对于指定的街道地址。
访问' https://ditu.amap.com/search?query=千山半岛'
（其中千山半岛是想查看地图的地址）。

二、处理命令行参数

mapIt.py

#! python3
# mapIt.py - Launches amap in the browser using an address from the
# command line or clipboard.
import pyperclip
import webbrowser
import sys

if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()
webbrowser.open('https:/ditu.amap.com/search?query=' + address)

表11-1 不用和利用mapIt.py 取得地图。

手工取得地图	利用mapIt.py
高亮标记地址	高亮标记地址
拷贝地址	拷贝地址
打开Web浏览器	运行mapIt.py
打开 `https://ditu.amap.com`
点击地址文本字段
粘贴地址
按回车。
从这个程序中让日常任务不那么繁琐了。

三、处理剪贴板内容，加载浏览器

search.py

#! python3
# mapIt.py - Launches amap in the browser using an address from the
# command line or clipboard.
import pyperclip
import webbrowser
import sys

if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()
webbrowser.open('https:/ditu.amap.com/search?query=' + address)


#! python3
# mapIt.py - Launches amap in the browser using an address from the
# command line or clipboard.
import pyperclip
import webbrowser
import sys

if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()
#webbrowser.open('https:/ditu.amap.com/search?query=' + address)

webbrowser.open('https://cn.bing.com/search?q=' + address)

03-使用requests 模块下载文件

1、 requests 模块让你很容易从Web 下载文件，不必担心一些复杂的问题，诸如网络错误、连接问题和数据压缩。

2、 requests 模块不是Python 自带的，所以必须先安装。通过命令行，运行pip install requests

一、用requests.get()函数下载一个网页

1、代码概览

1、 requests.get()函数接受一个要下载的 URL 字符串。

2、但现在请在交互式环境中输入以下代码，并保持计算机与因特网的连接：

import requests

res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
try:
    res.raise_for_status()
except Exception as exec:
    print('There was a problem: %s' % (exec))

playFile = open('RomeAndJuliect.txt','wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)
playFile.close()

3、该 URL 指向一个文本页面，其中包含整部罗密欧与朱丽叶，它是由古登堡计划提供的。

4、如果请求成功，下载的页面就作为一个字符串，保存在 Response 对象的 text变量中。这个变量保存了包含整部戏剧的一个大字符串，调用 len(res.text)表明，它的长度超过178000 个字符。
5、调用 print(res.text[:250] 显示前250字符。

2、检查错误

可以用try 和except 语句将raise_for_status()代码行包裹起来，处理这一错误，不让程序崩溃。

try:
    res.raise_for_status()
except Exception as exec:
    print('There was a problem: %s' % (exec))

04-HTML 分析

更多详情请参见 HTML 章节。